Self-Taught Hashing for Fast Similarity Search 
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ABSTRACT 

The ability of fast similarity search at large scale is of great 
importance to many Information Retrieval (IR) applications. 
A promising way to accelerate similarity search is semantic 
hashing which designs compact binary codes for a large num- 
ber of documents so that semantically similar documents 
are mapped to similar codes (within a short Hamming dis- 
tance). Although some recently proposed techniques are 
able to generate high-quality codes for documents known 
in advance, obtaining the codes for previously unseen doc- 
uments remains to be a very challenging problem. In this 
paper, we emphasise this issue and propose a novel Self- 
Taught Hashing (STH) approach to semantic hashing: we 
first find the optimal Z-bit binary codes for all documents in 
the given corpus via unsupervised learning, and then train 
/ classifiers via supervised learning to predict the /-bit code 
for any query document unseen before. Our experiments on 
three real-world text datasets show that the proposed ap- 
proach using binarised Laplacian Eigenmap (LapEig) and 
linear Support Vector Machine (SVM) outperforms state- 
of-the-art techniques significantly. 
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1. INTRODUCTION 

The problem of similarity search (aka, nearest neighbour 
search) is: given a query documenlQ find its most similar 
documents from a very large document collection (corpus). 
It is of great importance to many Information Retrieval (IR) 
ISO ] applications, such as near-duplicate detection [18], pla- 
giarism analysis 03] , collaborative filtering ^26] , caching [32] , 
and content-based multimedia retrieval |28) . 

Recently, with the rapid evolution of the Internet and the 
increased amounts of data to be processed, how to conduct 
fast similarity search at large scale has become an urgent re- 
search issue. A promising way to accelerate similarity search 
is semantic hashing [34] which designs compact binary codes 
for a large number of documents so that semantically simi- 
lar documents are mapped to similar codes (within a short 
Hamming distance). It is extremely fast to perform similar- 
ity search over such binary codes [42] . because 

• the encoded data are highly compressed and thus can 
be loaded into the main memory; 

• the Hamming distance between two binary codes can 
be computed efficiently by using bit XOR operation 
and counting the number of set bits [251 146] : an ordi- 
nary PC today would be able to do millions of Ham- 
ming distance computation in just a few milliseconds. 

Furthermore, we usually just need to retrieve a small number 
of the most similar documents (i.e., nearest neighbours) for 
a given query document rather than computing its similarity 
to all documents in the collection. In such situations, we can 
simply return all the documents that are hashed into a tight 

^In similarity search, a document is used as the query for 
retrieval, which is fundamentally different with the standard 
keyword search paradigm, e.g., in TREC. 



Hamming ball centred around the binary code of the query 
document. For example, assuming that we use 4-bit binary 
codes, if the query document is represented as '0000', then 
we can just check this code as well as those 4 codes within 
one Hamming distance to it (i.e., having one bit difference 
with it) — '1000', '0100', '0010', and '0001' — and return 
the associated documents back. It will also be easy to filter 
or re-rank the very small set of "good" documents (returned 
by semantic hashing) based on their full content, so as to 
further improve the retrieval effectiveness with just a little 
extra time [42 [ . 

In addition, similarity search serves as the basis of a clas- 
sic non-parametric machine learning method, the k-Nearest- 
Neighbours (kNN) algorithm [31], for automated text cate- 
gorisation 137^ and so on. By enabling fast similarity search 
at large scale, semantic hashing makes it feasible to exploit 
"the unreasonable effectiveness of data" [14] to accomplish 
traditionally difficult tasks. For example, researchers re- 
cently achieved great success in scene completion and scene 
recognition using millions of images on the Web as training 
data [I5][44]. 

Although some recently proposed techniques are able to 
generate high-quality codes for the documents known in ad- 
vance, obtaining the codes for previously unseen documents 
remains to be a very challenging problem [42]. Existing 
methods either have prohibitively high computational com- 
plexity or impose exceedingly restrictive assumptions about 
data distribution (see Section l3.2p . In this paper, we em- 
phasise this issue and propose a novel Self- Taught Hashing 
(STH) approach to semantic hashing. As illustrated in Fig- 
ure [1] we first find the optimal /-bit binary codes for all 
documents in the given corpus via unsupervised learning, 
and then train I classifiers via supervised learning to predict 
the Z-bit code for any query document unseen before. 

Our experiments on three real-world text datasets show 
that the proposed approach using binarised Laplacian Eigen- 
map (LapEig) [3] and linear Support Vector Machine (SVM) 
[231136] outperforms state-of-the-art techniques significantly, 
while maintaining a high running speed. 

The rest of this paper is organised as follows. In Section 
[5] we review the related work. In Section [3] we present our 
approach in details. In Section [4] we show the experimental 
results. In Section [5] we make conclusions. 

2. RELATED WORK 

There has been extensive research on fast similarity search 
due to its central importance in many applications. For 
a low-dimensional feature space, similarity search can be 
carried out efficiently with pre-built space-partitioning in- 
dex structures (such as KD-tree) or data-partitioning index 
structures (such as R-tree) [7|. However, when the dimen- 
sionality of feature space is high (say > 10), similarity search 
aiming to return exact results cannot be done better than 
the naive method — a linear scan of the entire collection 
[45] . In the IR domain, documents are typically represented 
as feature vectors in a space of more than thousands of di- 
mensions [30]. Nevertheless, if the complete exactness of 
results is not really necessary, similarity search in a high- 
dimensional space can be dramatically speeded up by using 
hash-based methods which are purposefully designed to ap- 
proximately answer queries in virtually constant time [42] . 

Such hash-based methods for fast similarity search can be 
considered as a means for embedding high-dimensional fea- 



ture vectors to a low-dimensional Hamming space (the set 
of all 2' binary strings of length I), while retaining as much 
as possible the semantic similarity structure of data. Unlike 
standard dimensionality reduction techniques such as Latent 
Semantic Indexing (LSI) 5, 8_ and Locality-Preserving In- 
dexing (LPI) [171 116] , hashing techniques map feature vec- 
tors to binary codes, which is key to extremely fast simi- 
larity search (see Section [ij . One possible way to get bi- 
nary codes for text documents is to binarise the real- valued 
low-dimensional vectors (obtained from dimensionality re- 
duction techniques like LSI) via thresholding [34]. An im- 
provement on binarised-LSI that directly optimises a Ham- 
ming distance based objective function, namely Laplacian 
Co-Hashing (LCH), has been proposed recently [50| . 

The most well-known hashing technique that preserves 
similarity information is probably Locality-Sensitive Hash- 
ing (LSH) [T]. LSH simply employs random linear projec- 
tions (followed by random thresholding) to map data points 
close in a Euclidean space to similar codes. It is theoretically 
guaranteed that Eis the code length increases, the Hamming 
distance between two codes will asymptotically approach the 
Euclidean distance between their corresponding data points. 
However, since the design of hash functions for LSH is data- 
oblivtous, LSH may lead to quite inefficient (long) codes in 
practice [34ll48] . 

Several recently proposed hashing techniques attempt to 
overcome this problem by finding good data-aware hash 
functions through machine learning. In [34], the authors pro- 
posed to use stacked Restricted Boltzmann Machine (REM) 
[191 120] . and showed that it was indeed able to generate 
compact binary codes to accelerate document retrieval. Re- 
searchers have also tried the boosting approach to Similarity 
Sensitive Coding (SSC) [38] and Forgiving Hashing (FgH) 
[2] — they first train AdaBoost [35] classifiers with simi- 
lar pairs of items as positive examples (and also non-similar 
pairs of items as negative examples in SCC), and then take 
the output of all (decision stump) weak learners on a given 
document as its binary code. In [44], both stacked- RBM and 
boosting-SSC were found to work significantly better and 
faster than LSH when applied to a database containing tens 
of millions of images. In [IS], a new technique called Spectral 
Hashing (SpH) was proposed. It has demonstrated signifi- 
cant improvements over LSH, stacked-RBM and boosting- 
SSC in terms of the number of bits required to find good 
similar items. There is some resemblance between the first 
step of SpH and the unsupervised learning stage of our STH 
approach, because both are related to spectral graph parti- 
tioning 6, 13, 40,. Nevertheless, we use a different spectral 
method and take a different way to address the entropy max- 
imising criterion (see Section [3. ip . More importantly, in or- 
der to process query documents, SpH has to assume that the 
data are uniformly distributed in a hyper-rectangle, which is 
apparently very restrictive. In contrast, our proposed STH 
approach can work with any data distribution and it is much 
more flexible (see Section I3.2|l . The superiority of STH to 
SpH has been confirmed by our experimental results (see 
Section [4|| . 

A somewhat related, but different, line of research is to 
use hashing representations for machine learning [411 147] . 
The objective of such techniques is to accelerate complex 
learning algorithms, but not similarity search. Our work is 
basically the other way around. 
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Figure 1: The proposed STH approach to semantic hashing. 
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3. APPROACH 

The proposed Self- Taught Hashing (STH) approach to se- 
mantic hashing is a general learning framework that consists 
of two distinct stages, as illustrated in Figure[T] We call the 
approach "self-taught" because the hash function is learnt 
from the data that are auto-labelled by itself in the previous 
stagfQ- 

3.1 Stage 1: 

Unsupervised Learning of Binary Codes 

Given a collection of n documents which are represented 
as m-dimensional vectors {xi}"^i C K'". Let X denote the 
m X n term-document matrix: [xi, . . . ,x„]. Suppose that 
the desired length of code is / bits. We use £ { — 1,-fl}' 
to represent the binary code for document vector Xi, where 
the p-th element of y^, i.e., y'^', is -1-1 if the p-th bit of code 
is on, or —1 otherwise. Let Y denote the n x I matrix whose 
i-th row is the code for the z-th document, i.e., [yi, . . . , yn] - 

A "good" semantic hashing should be similarity preserv- 
ing to ensure effectiveness. That is to say, semantically sim- 
ilar documents should be mapped to similar codes within a 
short Hamming distance. 

Unlike the existing approaches (such as SpH 148 ) that 
aim to preserve the global similarity structure of all docu- 
ment pairs, we focus on the local similarity structure, i.e., 
fc-nearest-neighbourhood, for each document. Since IR ap- 
plications usually put emphasis on a small number of most 
similar documents for a given query document [30] , preserv- 
ing the global similarity structure is not only unnecessary 
but also likely to be sub-optimal for our problem. Therefore, 
using the cosine sinularity^ [30], we construct our nxn local 



^It is, however, worth noticing tha t the term "self-taught 
learning" has been mentioned in "SS" where the intention was 
to describe a strategy for transfer learning based on sparse 
coding, whereas in this paper the term has a rather different 
meaning. 

^Our approach can work with any legitimate similarity mea- 
sure, though we focus on cosine similarity in this paper. 



similarity matrix W as 

Wij = \ l| xj.||x, ll ^ ^ft(xj) or Xj e iVfc(xi) 

[ otherwise 

(1) 

where A^fc(x) represents the set of fc-nearest-neighbours of 
document x. In other words, W is the adjacency matrix of 
the fc-nearest-neighbours graph for the given corpus [5]- A 
by-product of focusing on such a local similarity structure 
instead of the global one is that W becomes a sparse ma- 
trix. This not only leads to much lower storage overhead, 
but also brings a significant reduction to the computational 
complexity of subsequent operations. Furthermore, we in- 
troduce a diagonal nxn matrix D whose entries are given 
by Da — X^^^i Wij. The matrix D provides a natural mea- 
sure of document importance: the bigger the value of Du is, 
the more "important" is the document Xi as its neighbours 
are strongly connected to it [3]. 

The Hamming distance between two binary codes y^ and 
Yj (corresponding to documents Xi and Xj ) is given by the 
number of bits that are different between them, which can 
be calculated as j\\yi — y^lp. To meet the similarity pre- 
serving criterion, we seek to minimise the weighted average 
Hamming distance (as in SpH [48] ) 

^ n n 

because it incurs a heavy penalty if two similar documents 
are mapped far apart. After some simple mathematical 
transformation, the above objective function can be rewrit- 
ten in matrix form as i Tt{Y^ LY), where L = D — W is 
the graph Laplacian [6], and Tr(-) means the matrix trace. 

We found the above objective function ([2]) actually pro- 
portional to that of a well-known manifold learning algo- 
rithm, Laplacian Eigenmap (LapEig) except that LapEig 
does not have the constraint y^ G { — 1, -1-1}'. So, if we relax 
this discreteness condition but just keep the similarity pre- 
serving requirement, we can get the optimal /-dimensional 
real- valued vector fi to represent each document Xi by solv- 



ing the following LapEig problem: 

argmin Tr(y^Ly) (3) 

Y 

subject to f^DY = / 
Y'^DI = 

where Tx(Y^ LY) gives the real relaxation of the weighted 
average Hamming distance TriY^ LY), and the two con- 
straints prevent the collapse into a subspace of dimension 
less than The solution of this optimisation problem is 
given by F = [vi , . . . , Vi] whose columns are the / eigenvec- 
tors corresponding to the smallest eigenvalues of the follow- 
ing generalised eigenvalue problem (except the trivial eigen- 
value 0): 

Lv = XDv (4) 

The above LapEig formulation (|3]) may look similar to the 
first step of SpH j48j. This is because SpH is motivated by 
a spectral graph partitioning method ratio-cut [13j . while 
LapEig is closely connected to another spectral graph par- 
titioning method normalised- cut [40]. Many independent 
studies have shown that normalised-cut has better theoreti- 
cal properties and empirical performances than ratio-cut [6l 

SO]. 

We now convert the above Z-dimensional real-valued vec- 
tors yi , . . . , y„ into binary codes via thresholding: if the 
p-th element of yi is larger than the specified threshold, 
y^^^ — -1-1 (i.e., the p-th bit of the i-th code is on); other- 
wise, yl^-' — —1 (i.e., the p-th bit of the i-th code is off). 

A "good" semantic hashing should also be entropy max- 
imising to ensure efficiency, as pointed out by [2]. Accord- 
ing to the information theory [39]: the maximal entropy of 
a source alphabet is attained by having a uniform probabil- 
ity distribution. If the entropy of codes over the corpus is 
small, it means that documents are mapped to only a small 
number of codes (hash bins), thereby rendering the hash ta- 
ble ineSicient. To meet this entropy maximising criterion, 
we set the threshold for binarising y[^^ , ■ ■ ■ , Vn^ to be the 
median value of Vp. In this way, the p-th bit will be on for 
half of the corpus and off for the other half. Furthermore, as 
the eigenvectors vi , . . . , v; given by LapEig are orthogonal 
to each other, different bits y'^', . . . , j/*-'' in the generated bi- 
nary codes will be uncorrelated. Therefore this thresholding 
method gives each distinct binary code roughly equal proba- 
bility of occurring in the document collection, thus achieves 
the best utilisation of the hash table. 

3.2 Stage 2: 

Supervised Learning of Hash Function 

Mapping all documents in the given corpus to binary codes 
does not completely solve the problem of semantic hash- 
ing, because we also need to know how to obtain the binary 
codes for query documents, i.e., new documents that are 
unseen before. This problem, called out-of-sample extension 
in manifold learning, is often addressed using the Nystrom 
method [4j[9]. However, calculating the Nystrom extension 
of a new document is as computationally expensive as an 
exhaustive similarity search over the corpus (that may con- 
tain millions of documents), which makes it impractical for 
semantic hashing. In LPI |17l 116) , LapEig [3] is extended to 
deal with new samples by approximating a linear function 
to the embedding of LapEig. However, the computational 



complexity of LPI is very high because its learning algorithm 
involves eigen-decompositions of two large dense matrices. It 
is infeasible to apply LPI if the given training corpus is large. 
In SpH [48], new samples are handled by utilising the latest 
results on the convergence of graph Laplacian eigenvectors 
to the Laplace-Beltrami eigenfunctions of manifolds. It can 
achieve both fast learning and fast prediction, but it relies 
on a very restrictive assumption that the data are uniformly 
distributed in a hyper-rectangle. 

Overcoming the limitations of the above techniques [U [51 
1171 1161 148] ■ this paper proposes a novel method to com- 
pute the binary codes for query documents by considering 
it as a supervisecQ learning problem: we think of each bit 
yl'''' G {-1-1,-1} in the binary code for document Xi as a 
binary class label (class-"on" or class-"off") for that docu- 
ment, and train a binary classifier = (x) on the 
given corpus that has already been "labelled" by the above 
binarised-LapEig method, then we can use the learned bi- 
nary classifiers f^-^\ . . . , to predict the /-bit binary code 
j/*-^' , . . ■ , y^''' for any query document x. As mentioned in the 
previous section, different bits y^^\ . . . , y^'-' in the generated 
binary codes are uncorrelated. Hence there is no redundancy 
among the binary classifiers f^^\ . . . , and they can also 
be trained independently. 

In this paper, we choose to use the Support Vector Ma- 
chine (SVM) [231 136] algorithm to train these binary classi- 
fiers. SVM in its simplest form, linear SVM /(x) — sgn(w"^x) 
consistently provides state-of-the-art performance for text 
classification tasks [1UII22IH5] . Given the documents xi, . . . , X; 
together with their self-taught binary labels for the p-th bit 
y[^^ , . . . , VrF^ , the corresponding linear SVM can be trained 
by solving the following quadratic optimisation problem 

argmin iw^w+— Y^^i (5) 

w.5i>o 2 

subject to V"=i : y'^'-'w'^Xi > 1 — 

A notable advantage of using SVM classifiers here is that 
we can easily achieve non-linear mappings if necessary by 
plugging in non- linear kernels [36], though we do not explore 
this potential in this paper. 

3.3 Summary of Approach 

We name the above proposed two-stage approach Self- 
Taught Hashing (STH). In this paper, we choose binarised- 
LapEig [3] for the unsupervised learning stage and linear- 
SVM 23] [HS] for the supervised learning stage, but obviously 
it is possible to use other machine learning algorithms. 

The learning process of STH for a given corpus can be 
summarized as follows. 

1. unsupervised learning of binary codes: 

• construct the fc-nearest-neighbours graph for the 
given corpus; 

• embed the documents in an /-dimensional space 
through LapEig @ to get an /-dimensional real- 
valued vector for each document; 

^Since in the second stage, the supervised learning algorithm 
uses only the psetic/o-labels input from the previous unsuper- 
vised learning stage, the entire STH approach remains to be 
unsupervised. 



• obtain an Z-bit binary code for each document via 
thresholding the above vectors at their median 
point, and then take each bit as a binary class 
label for that document; 

2. supervised learning of hash function: 

• train I SVM classifiers ((S)) based on the given cor- 
pus that has been "labelled" as above. 

Let s denote the average number of non-zero features per 
document. In the first stage, constructing the fc-nearest- 
neighbours graph takes 0{n^ s+ri^k) time using the selection 
algorithm 7 , solving the LapEig problem Q takes 0{lnkt) 
time using the Lanczos algorithm [T2] of t iterations (the 
value of t is usually quite small), and the median-based bi- 
narisation takes 0{ln) time again using the selection algo- 
rithm Tl. In the second stage, thanks to the recent advances 
in large-scale optimisation, each of the I linear SVM classi- 
fiers can be trained in 0(sn) time or even less [241 121| . so 
all training can be done in 0{lsn) time. Both the value of 
I and the value of k can be regarded as small constants, as 
usually a short code length is desirable and just a few near- 
est neighbours are needed. For example, I < 64 and k — 25 
in our experiments (see Section Q. Therefore the overall 
computational complexity of the learning process is roughly 
quadratic to the number of documents in the corpus while 
linear to the average size of the documents in the corpus. 

The predicting process of STH for a given query docu- 
ment is simply to classify the query document using those 
I learned classifiers and then assemble the output I binary 
labels into an Z-bit binary code. For linear SVM, classi- 
fying a document only requires one dot-product operation 
between two vectors, the aggregated support vector and the 
document vector (with s' non-zero features), which can be 
done quickly in O(s') time. Therefore the overall computa- 
tional complexity of the prediction process for each query 
document is linear to the size of the query document. 

4. EXPERIMENTS 

We now empirically evaluate our proposed STH approach 
(using binarised-LapEig and linear-SVM), and compare its 
performance with binarised-LSI [34], LCH [50], and SpH [48] 
that represents the state of the art (see Section [2} . 

In the following STH experiments, the parameter k = 
25 when constructing the fc-nearest-neighbours graph for 
LapEig, and the SVM implementation is from LIBLINEAR 
[llj with the default parameter valuefl 

4.1 Data 

We have conducted experiments on three publicly avail- 
able real-world text datasets: Reuters2157fl 20Newsgroup^ 
and TDTfl 



^In principle, the value of k for LapEig should be set to the 
desired number of original nearest neighbours to be retrieved 



(see Section [42]) 

®It is not necessary to fine tune the SVM parameters (such 
as C) because it has already worked very well with its default 
parameter values. 

^http:/ /www. daviddlewis.com/resources/testcollections/ 
reuters21578/ 

*http:/ /people. csail.mit.edu/jrennie/20Newsgroups/ 
^http:/ / www.nist.gov/speech/tests/tdt /tdt98/index.htm 



The Reuters21578 corpus is a collection of documents that 
appeared on Reuters newswire in 1987. It contains 21578 
documents in 135 categories. In our experiments, those doc- 
uments appearing in more than one category were discarded, 
and only the largest 10 categories were kept, thus leaving us 
with 7285 documents in total. We use the ModeApte split 
here which gives 5228 (72%) documents for training and 
2057 (28%) documents for testing. 

The 20Newsgroups corpus was collected and originally 
used for document categorisation by Lang [27]. We use the 
popular 'bydate' version which contains 18846 documents, 
evenly distributed across 20 categories. The time-based split 
leads to 11314 (60%) documents for training and 7532 (40%) 
documents for testing. 

The TDT2 (NIST Topic Detection and Tracking) corpus 
consists of data collected during the first half of 1998 and 
taken from 6 sources, including 2 newswires (APW, NYT), 
2 radio programs (VOA, PRI) and 2 television programs 
(CNN, ABC). It consists of 11201 on-topic documents which 
are classified into 96 semantic categories. In our experi- 
ments, those documents appearing in more than one cate- 
gory were discarded, and only the largest 30 categories were 
kept, thus leaving us with 9394 documents in total. We ran- 
domly selected 5597 (60%) documents for training and 3797 
(40%) documents for testing. The averaged performance 
based on 10 such random selections is reported in this pa- 
per. 

All the above datasets have been pre-processed by stop- 
word removal. Porter stemming, and TF-IDF weighting [5D] . 

For the purpose of reproducibility, we shall make the datasets 
and code used in our experiments publicly available at the 
first author's homepage upon paper publication. 

4.2 Evaluation 

Given a dataset, we use each document in the test set 
as a query to retrieve documents in the training set within 
a specified Hamming distance, and then compute standard 
retrieval performance measures: precision, recall, and their 
harmonic mean {Fi measure) [30) . 

. . the number of retrieved relevant documents 
precision — 



recall = 



the number of all retrieved documents 

(6) 

the number of retrieved relevant documents 
the number of all relevant documents 



(7) 

The reported performance scores in the following Section are 
averaged over all test queries in the dataset. 

To determine whether a retrieved document is "relevant" 
to the given query document, we adopt the following two 
evaluation methodologies: 

1. retrieving original nearest neighbours — the k 

most similar documents, i.e., nearest neighbours, in 
the original vector space are considered as the ground- 
truth relevant documents {k — 25 in our experiments); 

2. retrieving same-topic documents — the documents 
on the same topic, i.e., in the same category, are con- 
sidered as the ground-truth relevant documents. 

The former methodology is used in [^P'l. while the lat- 
ter methodology is used in [34] . In our opinion, these two 

^''Actually only precision is used in [48], which is appropriate 



methodologies emphasise different aspects of semantic hash- 
ing, and thus are suitable for different target IR applications. 
Therefore we use both of them in this paper. 

The absolute performance scores of STH are not as im- 
portant as how they compare with those of other semantic 
hashing techniques. As previously mentioned in Section [Tl 
if necessary, we can always spend a little extra time to filter 
or re-rank the similarity search results based on their full 
content, thus achieve higher performance scores [42] . 

4.3 Results 

Figure [5] and Figure show the Fi measure of STH for 
retrieving original nearest neighbours and same-topic doc- 
uments respectivel}{3- We vary the code length from 4-bit 
to 64-bit and also the Hamming ball radius (i.e., the max- 
imum Hamming distance between any retrieved document 
and the query document) from to 3, in order to show their 
influences on the retrieval performance. It can be seen that 
when the code length increases, STH is able to achieve a 
higher Fi measure (using a bigger Hamming ball radius). 
However, longer binary codes demand more memory and 
a bigger Hamming ball radius requires more computation. 
The optimal trade-off between effectiveness and efficiency 
can be found by using a validation set of query documents. 

Figure 3] and Figure [S] compare STH with several other 
typical semantic hashing methods in terms of their precision- 
recall curves (created by varying the code length from 4- 
bit to 64-bit while fixing the Hamming ball radius at 1), 
for retrieving original nearest neighbours and same-topic 
documents respectiveljQ- It is clear that on all datasets 
and under both evaluation methodologies, STH outperforms 
binarised-LSI, LCH, and the state-of-the-art technique SpH 
(that has already been shown to work much better than LSH 
PP, stacked-RBJvQ 34; and boosting-SSC 38 ). Using 16- 
bit codes and Hamming ball radius 1, the performance im- 
provements are all statistically significant (P value < 0.01) 
according to one-sided micro sign test (s-test) [35] . 

We think the superior performance of STH is due to two 
reasons: 

• the binary codes produced by binarised-LapEig effec- 
tively preserve the semantic similarity structure while 
maximising the entropy of the hash table; 

• the maximum-margin hyperplane produced by linear- 
SVM ensures high generalisation ability [36] . 

for their application of pattern recognition but obviously 
insufficient from the IR perspective. Due to this difference 
in performance measurement, their results are not directly 
comparable with ours. 

^^The Fi measure scores reported here should not be directly 
compared with those in text categorisation papers, as we are 
addressing a very different problem even though the same 
datasets may have been used for experimentation. 

Although we could achieve higher retrieval performance by 
utilising a bigger Hamming ball radius (e.g., 4), a large num- 
ber of binary codes (e.g., Cg4 = 635376 for 64-bit codes) 
would need to be checked for each query and then the effi- 
ciency gain brought by semantic hashing would diminish. 
^''For example, on the 20Newsgroups dataset, stacked-RBM 
achieves a maximum of Fi — 0.276 for retrieving same-topic 
documents with 128-bit codes, while the same level of per- 
formance can be obtained using our STH approach with just 
8-bit codes. 



We have also examined the approximation errors accumu- 
lated in each step of STH (see Section |3^ . Our anatomy re- 
veals that almost all approximation errors come from the di- 
mensionality reduction step using LapEig. However, LapEig 
does work better than alternative methods (such as LSI) 
for this step in our experiments, and it is a well-known 
hard problem to accurately detect the (intrinsic) dimen- 
sionality of data or effectively reduce the dimensionality of 
data. The median-based binarisation and SVM-based out- 
of-sample extension both work perfectly incurring little ap- 
proximation errors. 

The proposed STH approach (using binarised-LapEig and 
linear-SVM) to semantic hashing is pretty fast: on an ordi- 
nary PC with Intel Pentium4 3.00GHz CPU and 2GB RAM, 
our Matlab implementation of 64-bit STH takes approxi- 
mately 0.0165 second per document for training (which is 
about 10 times faster than SpH), and 0.0007 second per 
document for prediction. 

5. CONCLUSIONS 

The main contribution of this paper is a novel Self- Taught 
Hashing (STH) approach to semantic hashing for fast simi- 
larity search. By decomposing the problem of finding small 
codes for large data into two stages — unsupervised learn- 
ing and supervised learning — we achieve great flexibility 
in choosing learning algorithms. Using binarised-LapEig for 
the first stage and linear-SVM for the second stage, STH sig- 
nificantly outperforms binarised-LSI, LCH, and the state-of- 
the-art technique SpH [48[ . Since STH is a general learning 
framework, it is promising to achieve even higher effective- 
ness and efficiency if more powerful unsupervised or super- 
vised learning algorithms can be employed. 

We shall apply this technique to text mining tasks (such 
as automated text categorisation ^7\) and content-based 
multimedia retrieval [28] in the near future. It would also 
be interesting to combine semantic hashing and distributed 
computing (e.g., [29]) to further improve the speed and scal- 
ability of similarity search. 
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